Goto

Collaborating Authors

 language variety


MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Whitehouse, Chenxi, Ruder, Sebastian, Lin, Tony, Kurylo, Oksana, Takagi, Haruka, Lam, Janice, Busetto, Nicolò, Diaz, Denise, Guzmán, Francisco

arXiv.org Artificial Intelligence

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.Dataset https://huggingface.co/datasets/facebook/menlo In order for LLMs to be most useful across the globe, they need to be able to provide high-quality responses in many languages. Responses should be relevant (Zhuang et al., 2024), factually accurate (Jacovi et al., 2025), and natural (Marchisio et al., 2024; Guo et al., 2025), among other considerations. Ultimately, for interaction in any language to be seamless, responses need to be indistinguishable from those of a native speaker (Novikova et al., 2016; Liu et al., 2021). Language proficiency in humans has traditionally been evaluated via standardized tests (Jamieson et al., 2000). While such tests have been applied to evaluating LLMs (Anil et al., 2023; Mayor-Rocher et al., 2024; Lothritz & Cabot, 2025), they are difficult to scale and do not readily correspond to real-world conversations.


The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties

Chen, William, Meng, Chutong, Shi, Jiatong, Bartelds, Martijn, Wang, Shih-Heng, Wang, Hsiu-Hsuan, Mosquera, Rafael, Hincapie, Sara, Jurafsky, Dan, Anastasopoulos, Antonis, Lee, Hung-yi, Livescu, Karen, Watanabe, Shinji

arXiv.org Artificial Intelligence

Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOT A) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOT A multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.


Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish

Phan, Nhan, Kuronen, Mikko, Kautonen, Maria, Ullakonoja, Riikka, von Zansen, Anna, Getman, Yaroslav, Voskoboinik, Ekaterina, Grósz, Tamás, Kurimo, Mikko

arXiv.org Artificial Intelligence

Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers' spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model's Recall (77.5%) and Precision (17.6%).


Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mamasaidov, Mukhammadsaid, Aral, Azizullah, Shopulatov, Abror, Inomjonov, Mironshoh

arXiv.org Artificial Intelligence

Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.


TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

Riley, Parker, Shakeri, Siamak, Ammar, Waleed, Clark, Jonathan H.

arXiv.org Artificial Intelligence

We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models' abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.


In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

Roll, Nathan, Graham, Calbert, Tatsumi, Yuka, Nguyen, Kim Tien, Sumner, Meghan, Jurafsky, Dan

arXiv.org Artificial Intelligence

Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.


Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Tjuka, Annika, Forkel, Robert, Rzymski, Christoph, List, Johann-Mattis

arXiv.org Artificial Intelligence

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.


Designing Speech Technologies for Australian Aboriginal English: Opportunities, Risks and Participation

Hutchinson, Ben, Louro, Celeste Rodríguez, Collard, Glenys, Cooper, Ned

arXiv.org Artificial Intelligence

In Australia, post-contact language varieties, including creoles and local varieties of international languages, emerged as a result of forced contact between Indigenous communities and English speakers. These contact varieties are widely used, yet are poorly supported by language technologies. This gap presents barriers to participation in civil and economic society for Indigenous communities using these varieties, and reproduces minoritisation of contemporary Indigenous sociolinguistic identities. This paper concerns three questions regarding this context. First, can speech technologies support speakers of Australian Aboriginal English, a local indigenised variety of English? Second, what risks are inherent in such a project? Third, what technology development practices are appropriate for this context, and how can researchers integrate meaningful community participation in order to mitigate risks? We argue that opportunities do exist -- as well as risks -- and demonstrate this through a case study exploring design practices in a real-world project aiming to improve speech technologies for Australian Aboriginal English. We discuss how we integrated culturally appropriate and participatory processes throughout the project. We call for increased support for languages used by Indigenous communities, including contact varieties, which provide practical economic and socio-cultural benefits, provided that participatory and culturally safe practices are enacted.


Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists

Snee, David, Ciucci, Luca, Rubehn, Arne, van Dam, Kellen Parker, List, Johann-Mattis

arXiv.org Artificial Intelligence

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.


Enhancing Portuguese Variety Identification with Cross-Domain Approaches

Sousa, Hugo, Almeida, Rúben, Silvano, Purificação, Cantante, Inês, Campos, Ricardo, Jorge, Alípio

arXiv.org Artificial Intelligence

Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.